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Abstract 

We investigate different ways of generating approximate solutions to 
the inverse problem of pairwise Markov random field (MRF) model learn- 
ing. We focus mainly on the inverse Ising problem, but discuss also the 
somewhat related inverse Gaussian problem. In both cases, the belief 
propagation algorithm can be used in closed form to perform inference 
tasks. Our approach consists in taking the Bethe mean-field solution as 
a reference point for further perturbation procedures. We remark indeed 
that both the natural gradient and the best link to be added to a maxi- 
mum spanning tree (MST) solution can be computed analytically. These 
observations open the way to many possible algorithms, able to find ap- 
proximate sparse solutions compatible with belief propagation inference 
procedures. Experimental tests on various datasets with refined Lo or Li 
regularization procedures indicate that this approach may be a competi- 
tive and useful alternative to existing ones. 
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1 Introduction 

The problem at stake is a model selection problem, in the MRF families, where 
A'^ variables are observed pair by pair. The optimal solution is the MRF with 
maximal entropy obeying moment constraints. For binary variables, this hap- 
pens then to be the Ising model with highest log-likelihood. It is a difficult 
problem, where both the graph structure and the value of the fields and cou- 
pling have to be found. In addition, we wish to ensure that the model is com- 
patible with the fast inference algorithm "belief propagation" (BP) to be useful 
at large scale for real-time inference tasks. This leads us to look at least for a 
good trade-off between likelihood and sparsity. 

Concerning the Inverse Ising Problem (IIP), the existing approaches fall 
mainly in the following categories: 

• Purely computational efficient approaches rely on various optimization 
schemes of the log likelihood [21] or on pseudo-likelihood [18] along with 
sparsity constraints to select the only relevant features. 

• Common analytical approaches are based on the Plefka expansion [33] 
of the Gibbs free energy by making the assumption that the coupling 
constants Jij are small. The picture is then of a weakly correlated uni- 
modal probability measure. For example, the recent approach proposed 
in [7 is based on this assumption. 

• Another possibility is to assume that relevant coupling Jy have locally 
a tree-like structure. The Bethe approximation [37] can then be used 
with possibly loop corrections. Again this corresponds to having a weakly 
correlated uni-modal probability measure and these kind of approaches 
are referred as pseudo-moment matching methods in the literature for the 
reason explained in the previous section. For example the approaches 
proposed in [24l [35l [28] [36] are based on this assumption. 

• In the case where a multi-modal distribution is expected, then a model 
with many attraction basins is to be found and Hopfield-like models [IH] [5] 
are likely to be more relevant. To be mentioned also is a recent mean- 
field methods [35] which allows one to find in some simple cases the Ising 
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couplings of a low temperature model, i.e. displaying multiple probabilistic 
modes. 

On the side of inverse Gaussian problem, not surprisingly similar methods have 
been developed by explicit performing Lq and Li matrix norm penalizations on 
the inverse covariance matrices, so as to determine sparse non-zero couplings in 
estimated inverse covariance matrices for large-scale statistical inference appli- 
cations [inilSO] where direct inversion is not amenable. In our context the goal is 
a bit different. In general cases, the underlying inverse covariance matrix is not 
necessarily sparse. What we aim to find is a good sparse approximation to the 
exact inverse covariance matrix. Furthermore, sparsity constraint is not enough 
for constructing graph structure that is used in conjunction with BP, known 
sufficient conditions referred as walk-summability [26] (WS) are likely to be im- 
posed instead of (or in addition to) the sparsity constraint. To the best of our 
knowledge not much work taking this point into consideration at the noticeable 
exception of [2] by restricting the class of learned graph structures. To com- 
plete this overview, let us mention also that some authors proposed information 
based structure learning methods |30| quite in line with some approaches to be 
discussed in the present paper. 

In some preceding work dealing with a road traffic inference application, with 
large scale and real time specifications [121 EH H] j we have noticed that these 
methods could not be used blindly without drastic adjustment, in particular to 
be compatible with belief propagation. This led us to develop some heuristic 
models related to the Bethe approximation. The present work is an attempt to 
give a theoretical basis and firmer ground to these heuristics. 

2 Preliminaries 

2.1 Inverse Ising problem 

In this section we consider binary variables (x; £ {0, 1}), which at our conve- 
nience may be also written as spin variables Si = 2xi — 1 G {—1, !}• We assume 
that from a set of historical observations, the empirical mean (resp. covari- 
ance Xij) is given for each variable Si (resp. each pair of variable (s^, Sj)). In this 
case, from Jayne's maximum entropy principle j23| . imposing these moments to 
the joint distribution leads to a model pertaining to the exponential family, the 
so-called Ising model: 

^(s) = ^Tfhl ^^^(Yl + X] -^y^^^i) (2-1) 

i i,j 

where the local fields h = {hi} and the coupling constants J = {Jij} are the 
Lagrange multipliers associated respectively to mean and covariance constraints 
when maximizing the entropy of V. They are obtained as minimizers of the dual 
optimization problem, namely 

(h*, J*) = argminL[h, J], (2.2) 
(h,j) 
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with 



L[h, J] = log z[h, ^^^^ 



(2.3) 



the log likelihood. This leads to invert the linear response equations: 

dlogZ. 



dh, 
d\ogZ 

dJin 



[h,J] = 



(2.4) 



(2.5) 



rhij = rhirhj + Xij being the empirical expectation of SiSj. As noted e.g. in [7], 
the solution is minimizing the cross entropy, a KuUback-Leibler distance between 
the empirical distribution P based on historical data and the Ising model: 



DKLinv] ^ log z[h, J] 

i i<j 



S{V). 



(2.6) 



The set of equations ( 2.4|2.5 ) cannot be solved exactly in general because the 
computational cost of Z is exponential. Approximations resorting to various 
mean field methods can be used to evaluate Z[h, J]. 

Plefka's expansion To simplify the problem, it is customary to make use of 
the Gibbs free energy, i.e. the Legendre transform of the free energy, to impose 
the individual expectations m = {rhi\ for each variable: 

G[m, J] = h^(m)m + i^[h(m), J] 

(with F[h, J] = - log Z[h, J] , m is the ordinary scalar product) where h(m) 
depends implicitly on m through the set of constraints 



Note that by duality we have 



and 



dF 
dhi 

dG 
drrii 



r d'^G 1 




dh' 




dm 


-1 


r d^F 1 


-dmidnij . 




dm. 


ij 


I dhi 




-dhidhj. 



(2.7) 



(2.8) 



(2.9) 



i.e. the inverse susceptibility matrix. Finding a set of satisfying this last 



relation along with (2.8) yields a solution to the inverse Ising problem since the 
m's and x's are given. Still a way to connect the couplings directly with the 
covariance matrix is given by the relation 



dG 

dJ,,, 



(2.10) 
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The Plefka expansion is used to expand the Gibbs free energy in power of the 
coupHng Jij assumed to be smah. Multiplying all coupling J^- by a yields the 
following cluster expansion: 



G[m, aJ] = h"^(m, a)m + i^[h(m, a), aJ] 
= Go[m] + V— G„[m,J] 



(2.11) 
(2.12) 



where each term G„ corresponds to cluster contributions of size n in the number 
of links Jij involved, and h(m, a) depends implicitly on a in order to always ful- 
fill (2.7 1. This precisely is the Plefka expansion, and each term of the expansion 
(2.121 can be obtained by successive derivation of (2.11). We have 

1 + ?71i 



Go[m] =^ii;^log 



I - mi 1 - mi 
— 7, — log — - — 



Letting 



Hj ^ ^ JijSiSj ^ 



i<j 



using (2.7), the two first derivatives of (2.11) w.r.t a read 



dG[ni, a J] 

da 

d'^G[m,a3] , . Y^d/ii(m,a) , , 

= -Vara (iJj) - 2^ — Cova[Hj,Si), 



(2.13) 
(2.14) 



da^ ""^ '''' da 

where subscript a indicates that expectations, variance and covariance are taken 
at given a. To get successive derivatives of h(m,a) one can use (2.8). Another 
possibility is to express the fact that m is fixed. 



dmi 
da 







_ _ d dF[h{a),a3] 
da dhi 

= ^ h'j{a) Gov a{si, Sj) + Coy a{H. J, Si), 



givmg 



(2.15) 



where Xa is the susceptibility delivered by the model when a ^ 0. To get the 
first two terms in the Plefka expansion, we need to compute these quantities at 
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a = 0: 



Cov{Hj,Si) = ^Jijmj{l - m-), 
i 



(by convention Ja — in these sums). The first and second orders then finaUy 
read: 



G'i[m,J] = - y^J^jmiUij, 



G2[m, J] = - J] 4(1 - m2)(l - m|), 



and correspond respectively to the mean field and to the TAP approximation. 
Higher order terms have been computed in |17j . 

At this point we are in position to find an approximate solution to the inverse 
Ising problem, either by inverting equation (2.91 or (2.101. To get a solution at 
a given order n in the coupling, solving (2.10) requires G at order n + 1, while 
it is needed at order n in (2.9). 

Taking the expression of G up to second order gives 



dG 
dJii 



-miriij - Jij{l - m^)(l - m^). 



and (2.10) leads directly to the basic mean-field solution: 

MF _ Xij 



J, 



(1 - mf){l - m^) ' 



(2.16) 



At this level of approximation for G, using ( |2.8[ ) we also have 



1 1 + 

h^ = ^ log 

I 1 — nij 



which corresponds precisely to the TAP equations. Using now (2.9) gives 
dhi 



dnii 



Ignoring the diagonal terms, the TAP solution is conveniently expressed in terms 
of the inverse empirical susceptibility. 



•I. 



TAP 



1 + ^1 - 8mimj[x ' 



(2.17) 



where the branch corresponding to a vanishing coupling in the limit of small 
correlation i.e. small Xij and [x^^lij ^oi i ^ has been chosen. 
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Bethe approximate solution When the graph formed by the pairs for 
which the correlations Xij ^-re given by some observations is a tree, the following 
form of the joint probability corresponding to the Bethe approximation: 



^■^ p{Xi)p[Xj) 



(2.18) 



yields actually an exact solution to the inverse problem ( 2.2 ) , where the p are the 
single and pair variables empirical marginal given by the observations. Using 
the following identity 
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slO 



' Pi{s,)p,{sj) 2 



Plj 

2 ^pIp', 



logT^ 



,00 



P\, 

^0^0 



where the fohowing parametrization of the ^'s 



.7. dcf ^ / 1 



P 



xy dcf 



2 

1 + 



- =2^) = ^(l + ™,(2x-l)), 



1 



y 



^2 2 

(1 + rhi{2x - 1) + mj(22/ - 1) + rhij{2x - l){2y - 1) 



(2.19) 



(2.20) 



relating the empirical frequency statistics to the empirical "magnetizations" 
711 = 771, can be used. Summing up the different terms gives us the mapping 



onto an Ising model (2.1 1 with 



h^ = log # + 4 E ^^AW^ ) ' ^^-^^ 



(2.22) 



where di is the number of neighbors of i, using the notation j e 9i for "j 
neighbor of i" . The partition function is then explicitly given by 



ZBethe[p] = exp 



(2.23) 



The corresponding Gibbs free energy can then be written explicitly using (2.21 
2.23). With fixed magnetizations m^'s, and given a set of coupHngs {Jij}, the 



parameters rriij are implicit function 
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obtained by inverting the relations (2.22). For the Unear response, we get from 
alt I 

[r 



(|2.2l|) a result derived first in 

d, 



drrij 



ifi X! ((^11 + ^oi)(l 



dniik \ /I In/ dniik - 
dm, 'p^'^; p-j!' dm, ■ 



-((- 
16 Vpjj 



(9m,- 



1 



_ 9m^ 
^ ^ am,: 



)) 



Using (2.22), we can also express 





1 


1-1 1 - 


1 


dmij 








dmi 


1 




1 1 



so that with little assistance of Maple, we may finally reach the expression given 
in BH 



1 — m? 



E 



1 — 



(l-mf)(l-m2) 



(l~m,^)(l 



m2) 



Xife 



(2.24) 



equivalent to the original one derived in ,35|, albeit written in a different form, 
more suitable to discuss the inverse Ising problem. This expression is quite 
paradoxical since the inverse of the ['x\ij matrix, which coefficients appear on 
the right hand side of this equation, should coincide with the left hand side, 
given as input of the inverse Ising problem. The existence of an exact solution 
can therefore be checked directly as a self-consistency property of the input data 
Xij'- for a given pair (i, j) either: 



[x i]ij 7^ 0, then this self-consistency relation (2.24) has to hold and Jij 



is given by (2.22) using m 



■ Xij ■ 



• [x = then Jij = but y ^, , w hich can be non- vanishing, is obtained 
by inverting [x^^] defined by (2.24). 



Finally, complete consistency of the solution is checked on the diagonal ele- 



ments in (2.24). If full consistency is not verified, this equation can nevertheless 
be used to find approximate solutions. Remark that, if we restrict the set of 



equations (2.24 1, e.g. by some thresholding procedure, in such a way that the 
corresponding graph is a spanning tree, then, by construction, Xij = Xij will 
be solution on this restricted set of edges, simply because the BP equations are 
exact on a tree. The various methods proposed for example in [551 actually 
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correspond to different heuristics for finding approximate solutions to this set 
of constraints. As noted in |31j . a direct way to proceed is to eHminate Xij in 
the equations obtained from (2.221 and (2.24): 



Xl + ^Xtjim.mj - coth(2J,j)) + (1 - m'f){l - m^) = 



Xij 



X%] 



(1 - m\){\ - w?A = 0. 



This leads directly to 



' 2 V^l+4(l-m2)(l-m2)[x- 



112 



(2.25) 

while the corre spond ing computed of Xij , instead of the observed one Xij , has to 



be inserted in (|2.21|) to be fully consistent. Note that J^j'^*'^'^ and Jfj'^^ coincide 



at second order in [x ^ 



Hopfield model As mentioned in the introduction when the distribution to 
be modeled is multi-modal, the situation corresponds to finding an Ising model 
in the low temperature phase with many modes, referred to as Mattis states in 
the physics literature. Previous methods assume implicitly a high temperature 
where only one single mode, "the paramagnetic state" is selected. The Hopfield 
model, introduced originally to model auto-associative memories, is a special 
case of an Ising model, where the coupling matrix is of low rank p < N and 
corresponds to the sum of outer products of p given spin vectors j^*^, k — 1 . . .p}, 
each one representing a specific attractive pattern: 



1 ^ 

^ k=l 



In our inference context, these patterns are not given directly, the input of the 
model being the covariance matrix. In 8J these couplings are interpreted as 
the contribution stemming from the p largest principle axes of the correlation 
matrix. This lead in particular the authors to propose an extension of the 
Hopfield model by introducing repulsive patterns to take as well into account 
the smallest principal axes. Assuming small patterns coefficients j^*^! < 
they come up with the following couplings with highest likelihood: 



jHopfield 



1 



1 



1 



N-k 



at first order of the perturbation. At this order of approximation the local fields 
are given by 



hi — tanh(mi) — J.^ 



Hopfield 
ij ^ 
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In a previous study [IS] we found a connection between the plain direct BP 
method with the Hopfield model, by considering a 1-parameter deformation of 
the Bethe inference model ( |2.18[ ) 

^(x)=n(§%^)"n?5^(^')' (2-26) 

with a E [0,1]. We observed indeed that when the data corresponds to some 
multi-modal measure with well separated components, this measure coincides 
asymptotically with an Hopfield model made only of attractive pattern, repre- 
sentative of each component of the underlying measure, a represents basically 
the inverse temperature of the model and is easy to calibrate in practice. 



2.2 More on the Bethe susceptibiHty 



(a) 



(b) 



(d) 



(8) 



Figure 2.1: Various cumulant topologies of order three (a,b) and four (c-f). 



The explicit relation (2.24) between susceptibility and inverse susceptibility 



coefficients is not the only one that can be obtained. In fact, it is the specific 
property of a singly connected factor graph that two variables and Xj , con- 
ditionally to a variable Xk are independent if k is on the path between i and j 
along the tree: 



p{Xi,Xj,Xk) = p{Xi\xk)p{Xj\xk)p{xk) 



p{Xi,Xk)p{Xj,Xk) 

p{xk) 



1 2 ' 

1 - TO^ 



Using the paramctrization ( 2.19|2.2"0 ) with rriij — miirij+Xij yields immediately 

V fee along r. (2.27) 

[2^ . given the path zq = «, zi, . . . , in+i = 

nn 
a=0 Xiaia+l 



By recurrence we get, as noticed in e.g 
j between i and j along the tree T 



Xij 



n:=i(i-0' 
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reflecting the factorization of the joint measure. This expression actually coin- 
cides with (2.24) only on a tree. On a loopy graph, this last expression should 
be possibly replaced by a sum over paths. 

Higher order susceptibility coefficients are built as well in terms of elemen- 
tary building blocks given by the pairwise susceptibility coefficients Xij- The 
notations generalize into the following straightforward manner: 

m„fe = E{siSjSk) = m,mjmk + rmXjk + m^Xik + "ifcXij + Xijk 

rriijki = E{siSySkSi) = mimjmkmi 

+ mirrijXki + miUikXji + mimiXjk + mjmkXu + rUjiTiiXik + mkmiXij 

+ m-iXju + rnjXiki + m-kXiji + mXijk + XijkU 

where Xijk and Xijki are respectively three and four point susceptibilities. The 
quantities being related to the corresponding marg inals similarly to ( |2.19|2.20[ ): 

1 , 
o 

-I- rriijSiSj + rriikSiSk + m-jkSjSk + rriijkSiSjSk) 

p{si, Sj,Sk,si) = + + + ^'^kSk + misi 

+ rriijSiSj + rriikSiSk + muSiSi + m-jkSjSk + nijiSjSi + nikiSkSi 

+ rriijkSiSjSk + rriijiSiSjSi + mikiSiSkSi + rrijkiSjSkSi + rriijkiSiSjSkSi) 

Using the basic fact that, on the tree 

P{s.,s„sk)^ '^'-'^f'-''^ 
when J is on the path ik given by T, and 

p[S^,Sj,Sk) - ^ 

when path ij, ik and jk along T intersect on vertex Z, we obtain 

{-2 . ^\.2 ^tiX3iXki with {/} = (i, j) n (i, k) n (j, k) along T, 
-2nijXik if j e {i, k) along T. 

For the fourth order, m ore c ases have to be distinguished. When i, j, k and / 
are aligned as on Figure 2.1 c, in this order on the path il along T we have 

, X P{Si,Sj)p{Sj,Sk,Sl) 

p(Si,Sj,Sk,Sl) = - 



2 
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which leads to 

Xijki = 4:mkmjXii - XikXji ~ XiiXjk- 



When path ij, ik and jk along T intersect on vertex I as on Figure 2.1 d, we 
obtain instead El 

3mf - 1 
Xijki - 2 ^ _ XijXki- 

For the situation corresponding to Figure [2?T|e, we have 



P(^Si,Sj,Sk,Sl) = ^ 



p{SqY 

which leads to 

3to2 _ 1 
Xijki = 2 ^ _ ^2 XjjXfei- 



Finally, for the situation corresponding to Figure [2?T| f, we have 

.p{s^,Sj)p{Sj,Sk,Sl) 



P(^Si,Sj,Sk,Sl) = ^ • 

leading to 



2 



mkniq 



Xijki — XikXjl XjkXil ^ ^ -i 2'^*J-^'9' 

i — TOq 

2.3 Sparse inverse estimation of covariance matrix 

Let us leave the Ising modeling issue aside for a while and introduce another 
related graph selection problem, named sparse inverse covariance estimation, 
which is defined on real continuous random variables. This method aims at 
constructing a sparse factor graph structure by identifying conditionally inde- 
pendent pairs of nodes in the graph, given empirical covariances of random 
variables. Assuming that all nodes in the graph follow a joint multi-variate 
Gaussian distribution with mean /i and covariance matrix C, the existing cor- 
relation between the nodes i and j given all the other nodes in the graph are 
indicated by the non-zero ijth entry of the precision matrix C~^, while zero 
entries correspond to independent pairs of variables. Therefore, under the joint 
normal distribution assumption, selection of factor graph structures amounts to 
finding the sparse precision matrix that best describes the underlying data dis- 
tribution, given the fixed empirical covariance matrix. When the derived inverse 
estimation is sparse, it becomes easier to compute marginal distribution of each 
random variable and conduct statistical inference. To achieve that goal, opti- 
mizations methods have been developed based on Lq or Li norm penalty for the 
estimation of C~^, to enhance its sparse structure constraint on the estimated 
inverse of covariance matrix and discover underlying conditionally independent 
parts. 



^This apparently non-symmetric expression can be symmetrized with help of i2.27i 
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Let C G K"^" be the empirical covariance matrix of n random variables 
(represented as the nodes in the graph model) . The sparsity penalized maximum 
likelihood estimation A of the precision matrix C^^ can be derived by solving 
the following positive definite cone program: 

A = argmin -C{X) + \P{X) (2.28) 

where 

C{A) logdct(A) - Tt{AC), (2.29) 

is the log likelihood of the distribution defined by A, log det being the logarithm 
of the determinant, and P{A) is a sparsity inducing regularization term [13]. 
A is the regularization coefficient balancing the data-fitting oriented likelihood 
and sparsity penalty. Since the precision matrix of joint normal distribution 
should be positive definite, any feasible solution to this optimization problem 
is thus required to locate within a positive definite cone. The penalty term 
P{A) is typically constructed using sparsity inducing matrix norm, also known 
as sparse learning in the domain of statistical learning. There are two typical 
configurations of P{A): 

• The Lq norm || of the matrix X, which counts the number of non-zero 
elements in the matrix. It is also known as the cardinality or the non-zero 
support of the matrix. Given its definition, it is easy to find that Lq norm 
based penalty is a strong and intuitive appeal for sparsity structure of 
the estimated precision matrix. However, it is computationally infeasible 
to solve exact Lo-norm minimization directly, due to the fact that exact 
Lq norm penalty is discontinuous and non-differentiable. In practice, one 
either uses a continuous approximation to the form of the Lg-penalty, or 
solve it using a greedy method. Due to the non-convexity of the exact 
Lq norm penalty, only a local optimum of the feasible solution can be 
guaranteed. Nevertheless, Lq norm penalty usually leads to much sparser 
structure in the estimation, while local optimum is good enough for most 
practical cases. 

• The Li matrix norm 11^1]^^ = "^27] norm penalty was firstly 
introduced into the standard least square estimation problem by Tibshi- 
harni under the name "Lasso regression". Minimizing the Li norm 
based penalty encourages sparse non-zero entries in the estimated pre- 
cision matrix A, which achieves a selection of informative variables for 
regression and reduces complexity of regression model efficiently. Further 
extension of the basic Li-norm penalty function allows one assigning differ- 
ent non- negative weight values to different entries Aij, as -^ul^y I- 
This weighted combination can constrain the sparse penalties only on the 
off-diagonal entries, so as to avoid unnecessary sparsity on the diagonal 
elements. Furthermore, this extension allows us to introduce prior knowl- 
edge about the conditional independence structure of the graph into the 
joint combination problem. 
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Figure 2.2: Demonstration of Li norm, exact Lq norm and Seamless Lq norm 
based penalty fmiction. 



For further understanding of the relation between the exact Lq and Li norm 
penalty, we illustrate them with respect to one scalar variable in Figure [2?2] As 
we can see, within [—1, 1], Li-norm penalty plays as a convex envelope of the 
exact Lq-hothi penalty. Due to the convexity property of Li norm penalty, the 
global optimum of the convex programming problem can be achieved with even 
linear computational complexity (34l I13j . However, although Li norm based 
penalty leads to computationally sound solutions to the original issue, it also 
introduces modeling bias into the penalized maximum likelihood estimation. As 
illustrated in the figure, when the underlying true values of the matrix entries 
are sufficiently large, the corresponding Li norm based regularization performs 
linearly increased penalty to those entries, which thus results in a severe bias 
w.r.t. the maximum likelihood estimation jl2j . In contrast, Lq norm based 
penalty avoids such issue by constraining the penalties of non-zero elements to 
be constant. It has been proved in [33] that the Li norm penalty discovers 
the underlined sparse structure when some suitable assumptions are satisfied. 
However, in general cases, the quality of the solutions is not clear. 



3 Link selection at the Bethe point 

3.1 Bethe approximation and Graph selection 

As observed in the previous section, when using the Bethe approximation to 
find an approximate solution to the IIP, the consistency check should then be 
that either the factor graph be sparse, nearly a tree, either the coupling are 
small. There are then two distinct ways of using the Bethe approximation: 



the direct way, where the form of the joint distribution (2.18) is assumed 
with a complete graph. There is then a belief propagation fixed point for 
which the beliefs satisfy all the constraints. This solution to be meaning- 
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ful requires small correlations, so that the belief propagation fixed point 
be stable and unique, allowing the corresponding log likelihood to be well 
approximated. Otherwise, this solution is not satisfactory, but a pruning 
procedure, which amounts to select a sub-graph based on mutual infor- 
mation, can be used. The first step is to find the MST with these weights. 
Adding new links to this baseline solution in a consistent way is the subject 
of the next section. 

• the indirect way consists in first inverting the potentially non-sparse cor- 
relation matrix. If the underlying interaction matrix is actually a tree, 
this will be visible in the inverse correlation matrix, indicated directly 
by the non-zero entries. This seems to work better than the previous 
one when no sparsity but weak coupling is assumed. It corresponds in 
fact to the equations solved iteratively by the susceptibility propagation 
algorithm [28] . 

Let us first justify the intuitive assertion concerning the optimal model with 
tree like factor graphs, which was actually first proposed in j6j and which is 
valid for any type of MRF. 

Proposition 3.1. The optimal model with tree like factor graphs to the inverse 
pairwise MRF is obtained by finding the MST on the graph of weighted links 
with weights given by mutual information between variables. 

Proof. On a tree, the Bethe distribution relating marginal pi of single and pair 
Pij variables distributions 

is exact. Assuming first that the associated factor graph is given by a tree T, 



the expression (2.6) of the log likelihood leads to the optimal solution given that 
tree: 

Pi = Pi and ptj ^Pij, Vi S V, G T. 

In turn we have the following expression for the log likelihood: 

L= ^ hj-^Hi, 



with 



^_ dof _ log pi(xi) 



' p^{x^)pj(Xj)' 



respectively the single variable entropy and the mutual information between two 
variables. Since the single variable contributions to the entropy is independent 
of the chosen graph T, the optimal choice for T correspond to the statement of 
the proposition. ■ 
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3.2 Optimal 1-link correction to the Bethe point 

Adding one link Suppose now that wc want to add one link to the max 
spanning tree. The question is which Unk will produce the maximum improve- 
ment to the log likelihood. This question is a special case of how to correct a 

given factor for a general pairwisc model. Let Vq be the reference distribution 
to which we want to add (or modify) one factor V',:, to produce the distribution 

Pi(x)=Po(x)x ^^^'^^"^^'^ (3.1) 



with 

The log likelihood corresponding to this new distribution now reads 



Li = Lo + J dx'P{x)\ogtpij{xi, Xj) - logZ^. 



The functional derivative w.r.t. tp reads 

<9Li _ p(.T.,,.Tj-) _ p1j{xi,Xj) 

^'^ij (^i ) ) "^ij (^^ ) ) Z-f^j 

which yield immediately the solution 



\/{xi, Xj) £ fl^ 



^ij{xi,Xj) = ^^^f^^ with Z^ = l, (3.2) 

Pij{Xi, Xj) 

where po(-Ti,.Tj) is the reference marginal distribution obtained from Vq. The 
correction to the Log likelihood simply reads 

Ah = DKL(pij\\p'ij). (3.3) 

Sorting all the links w.r.t. this quantity yields the (exact) optimal 1-link correc- 
tion to be made. 

As a check, consider the special case of the Ising model. Adding one link 
amounts to set one to some finite value, but since this will perturb at least 
the local magnetization of i and j, we have also to modify the local fields hi and 
hj. The modified measure therefore reads 

_ exp{JijS,Sj + h,Si + hjSj) 

AZ(J, h) 

where AZ[J, h) is multiplicative correction factor to the partition function. We 
have 

AZ{hi, hj, Jij) = zo + zirhi + Z2rhj + z^mf^^^^, 
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after introducing the following quantities 





def 


^Jij+hi+hj 


+ e- 


■Jij 


— hi-\-hj 


+ e- 


■Jij+hi 


-hj 


+ e-'^' 


-hi 


-hj 


Z\ 


def 


^Jij+hi+hj 


— e~ 


■Jij 


— hi-\-hj 


+ e- 


'Jij ~\~hi 


-hj 


- e'^'^ 


-hi 


-hj 


Z2 


dot 


gJij+hi+hj 


+ e- 


■Jij 


— hi-\-hj 


— e~ 


■Jij+hi 


-hj 


- e'^'^ 


-hi 


-hj 


Z2, 


det 




— e~ 




— hi-\-hj 


— e~ 


-Jij+hi- 


-hj 


+ e-^'^' 


— hi 


-hj 



The correction to the log likelihood is then given by 

AL(/ii, hj, Jij) = log AZ(/ii, hj, Jij) - hiihi - hj-rhj - Jijin^j (3.4) 

This is a concave function of hi, hj and J,j, and the (unique) maximum is 
obtained when the following constraints are satisfied: 

d log(AZ) zi + zorhi + zsrhj + z^mf^^'^'' 



dhi 


AZ{hi,hj, Jij) 


aiog(Az) 


Z2 + z^rhi + zorhj + zimf^^^'' 


dhj 


AZ{hi,hj, Jij) 


aiog(Az) 


zs + Z2rhi + zirhj + ZQmfj'*'^'' 



dJij AZ{hi,hj,Jij) 
This constraints can be solved as follows. First let 

U {l + mi + rhj + ml'^^'^) e^^^^^^i) , 

V = (1 - mi + rhj - mf;'^'') e^C'^--^^), 

W = (l + rhi- vij - mf;'^^) e^C'--^-). 
to obtain the linear system 

U{1 - rrii) - V{1 + fhi) + W{1 - mi) = (1 - rn, - rhj + mf/*^''){l + m,), 

U{1 - rhj) + V{1 - Thj) - W{1 + rhj) = {l-rhi- rhj + m^^*'''){l + rhj), 

U{1 - rhij) - V{1 + rhij) - W{1 + rhij) = -(1 - - rhj + mf^'^''){l - rhij). 

Inverting this system yields the following solution 

{1-rhi- rhj + mf^/'"'") 
U = — j-^ ^ ^ — ^ — £ — ^ — (1 + rhi-\- rhj + rhij) 



(1- 


rhi — 


rhj + mf/''^''") 


(1 


- rhi 


— rhj + rhtj) 


(1- 


rhi — 


rhj + mf^"'") 


(1 


- rhi 


— rhj + rhi J ) 


(1- 


rhi - 


rrij + mff^'''') 



V = —- — -4 — ^ (1 - mj + rhj - rhij) 



W = —- \ — , ? . (1 + rhi- rhj - rhij) 

[l - rrii - rrij + rriij) 
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Using the parametrization (2.20), we finally arrive at 



^hiSi-\-hj Sj-\-J.ijSiSj 



P^J{S^, Sj) ( 1)6,, (-1, 1)&„(1, -l)6,j(-l, -1) V'^ 



hj{s,,Sj) VKj(1, 1)p,j(-1, l)p,j(l, -l);5jj(-l, -1) 



where bij(si,Sj) is the joint marginal of Si and Sj, obtained from the Bethe 
reference point. From these expressions, we can assess for any new potential 
link the increase (3.4 1 in the log likelihood. After rearranging all terms, it takes 
indeed as announced the following simple form 

AL(/i„/i„ J„) = ^K,-log f-'i'"'^j ^ DKLipWb)- (3.5) 

The interpretation is therefore immediate: the best candidate is the one for 
which the Bethe solution yields the most distant joint marginal bij to the tar- 
geted one Pij given by the data. Note that the knowledge of the {bij, (ij) ^ T} 
requires a sparse matrix inversion through equation (2.24 1, which renders the 
method a bit expensive in the Ising case. For Gaussian MRF, the situation is 
different, because in that case the correction to the log likelihood can be eval- 
uated directly by another means. Indeed, the correction factor (3.2 1 reads in 
that case 

^ij{x^,Xj) ^ exp(^-^{x^,XJ){[C'^]~^ - [C'^]-^){x^,Xj fy 

where [C'^^] and [C*^] represent the restricted 2x2 covariance matrix corre- 
sponding to the pair {xi, Xj) of respectively the reference model and the current 
model specified by precision matrix A — C^^ . With a small abuse of notation 
the new model obtained after adding or changing link [ij) reads 

A' = A + [C'^]-^ - [C^]-^ = A + V. 
with a log likelihoods variation given by: 

A J ^ii^jj + ^jj^ii ~ '^CijCij ^ii^jj ~ ^ij 

— 7^~r< 7^ ^ - iog 7^ ■ 

Let us notice the following useful formula (see e.g. [3]): 

{A + [V^])-^ = A-^ - A-^[V'^]{1 + A-^[V'^])-^A-^ 

= A-^ - A-^[V'^]{1 + [C''][V'^])-^A-\ (3.6) 

valid for a 2 x 2 perturbation matrix [V''^]. Using this formula, the new covari- 
ance matrix reads 

C" = A'-^ = A-^ - A-^[C'^]-\1 - [C'^][C'^]-'^)A-^. (3.7) 
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Therefore the number of operations needed to maintain the covariance matrix 
after each add-on is 0{N^). 

Let us now examine under which condition adding/modifying hnks in this 
way let the covariance matrix remain positive semi-definite. By adding a 2 x 2 
matrix, we expect a quadratic correction to the determinant: 

det(A') = det(A) dct(l + A-^V) 

= det(A) 1 + VuCu + VijCj, + Vj^Cij + VjjCjj 

+ {ViiVjj — VtjVji) {CiiCj-j - CijCji) , 

= detAx^Mai 

dct([C»J]) 

which is obtained directly because A~^V has non zero entries only on column i 
and j. Multiplying V by some parameter a > 0, define 



P{a) = det(l + aA-^V) = a' Aei{[C''][C'']'^ 



a 



When increasing a from to 1, P{a) will vary from 1 to det([C*^])/ det([C"'']) 
without canceling at some point iff [C*-'] [C'*-']"^ is definite positive. P{a) is 
proportional to the characteristic polynomial of a 2 x 2 matrix [C*-']/[C'*-']~^ 
of argument (a — 1)/q;, so A' remains positive definite if [C*^ ] [C'*-' ] ~ ^ is defi- 
nite positive. Since the product of eigenvalues given by det([C"-'])/ det([C'*^]) is 
positive, one has to check for the sum, given by the trace of [C"-']/[C'' 

C'iiCjj -\- CjjCii ICijCij > 0. 

Since [C*^] and [&^] are individually positive definite, we have 

and 

from which we deduce that 



(3.8) 



C-iiCjj C'ij > 



CiiCjj — Cfj > 



' CiiCjj + CjjCii\^'^ A ^ A ^ 2A2 



giving finally that (3.8) is always fulfilled when both [C"-'] and [C*-'] are non- 
degenerate. 

Each time a link is added to the graph, its number of loops increases by one 



unit, so in a sense (3.3) represent a 1-loop correction to the bare Bethe tree 
solution. 



Removing one link To use this in an algorithm, it would also be desirable 
to be able to remove links as well, such that with help of a penalty coefficient 
per link, the model could be optimized with a desired connectivity level. 
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For the Gaussian model, if A is the coupHng matrix, removing the hnk 
amounts to chose a factor ipij in (3.1 1 of the form: 

(xi and assumed centered as in the preceding section). Again, let V 

denote the perturbation in the precision matrix such that A' — A + V is the 
new one. The corresponding change in the log likelihood then reads 

AL = log det(l + A-^V) - Tt{VC). 

Arranging this expression leads to 

AL ^ log(l - 2A,^Q, - A% det([C'J"])) + 2A,,Q,. 



Using again (3.6) we get for the new covariance matrix 



C = C 



1 QiA'ij Cij 



Af.det([C'^]) 



C 



1 



1 Aij Cij 

Aij Cii 



C, (3.9) 



with again a slight abuse of notation, the 2x2 matrix being to be understood 
as a, N X N matrix with non-zero entries corresponding to (i, i), (j, i) and 

{j,j). To check for the positive-definiteness property of A', let us observe first 
that 

det(A') = det(A) x P{A,j), 

with 

Fix) = (1 - xia, - ,/c~c~))ii - x(Cy + ya^c")). 

When X varies from to Aij, P{x) should remains strictly positive to insure 
that A' is definite positive. This results in the following condition: 

1 1 

< Ay < 



Cij yJCii Cj 



y/CiiCjj + Ci- 



We are now equipped to define algorithms based on addition/deletion of links. 



3.3 Imposing walk-summability 

Having a sparse GMRF gives no guaranty to its compatibility with belief prop- 
agation. In order to be able to use the Gaussian Belief Propagation (GaBP) 
algorithm for performing inference tasks, stricter constraints have to be im- 
posed. The more precise condition known for convergence and validity of the 
GaBP algorithm is called walk-summability (WS) and is extensively described 
in [26]. The two necessary and sufficient conditions for WS that we consider 
here are: 

(i) Diag(A) - \R{A)\ is definite positive, with R{A) = A - Diag(A) the off- 
diagonal terms of A. 

(ii) The spectral radius p{\R'{A)\) < 1, with R'{A),j = -^Mh^. 
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Adding one link Using the criterion developed in the previous section sup- 
pose that in order to increase the likelihood we wish to add a link to the 
graph. The model is modified according to 

A' + [C'^]-^ - [C^]-^ = A + V. 

We assume that A is WS and we want to express conditions under which A' is 
still WS. 

Using the definition (i) of walk summability and mimicking the reasoning 
leading to (3.8) we can derive a sufficient condition for WS by replacing A with 
W{A) = Diag(^) - \R{A)\. It yields 

det {W{A + aV)) = det (WiA)) det (l + aWiAy^WiV)) 

= det (W) [l + a {wrWu + W-^V,, - 2Wr^'\V.,\} 

= det(T4^)Q(a), 

since R{A)ij = and shortening in W. A sufficient condition for WS of 

A' is that Q does not have any root in [0, 1]. Note that checking this sufficient 
condition imposes to keep track of the matrix (Diag(yl) — = W^^ IW 

and requires O(iV^) operations at each step, using ( |3.6| ). A more compact 
expression of Q is 

Q{a) = 1 + a Tr {lW'm{V)) + det {W{V)) det (IW^) , 

First let's tackle the special case where det (W{V)IW^^) = 0, the condition for 
WS of A' is then 

TT{IW''m{V)) > -1. 

Of course if the roots are not real, i.e. Tt{IW''^W{V))^ < 4 det {WiV)IW'^), 
A' is WS. If none of these conditions is verified we have to check that both roots 

- Tt{IW'^W{V) ± ^/Tt{IW^3W{V))^ - 4 det {W{V)IWi3) 
2dct {W{V)IW^3) ' 

are not in [0,1]. 



Modifying one link This is equivalent to adding one link in the sense that (3.11 
and (3.2) are still valid. If we want to make use of (i) the only difference is that 
R{A)ij is not zero before adding V so W{A + aV) = W{A) + aW{V) does not 
hold in general. Instead we have 

W{A + aV) = W{A) + (l}{aV, A), 

with 
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So we can derive a condition for A' to be WS using, as for the link addition, 

det{W{A + aV)) = det (W) det (l + W-^cj^iaV, A)) 
= det(VK)e(a) 

But now Q{a), the equivalent of (5(a), is a degree 2 polynomial only by parts. 
More precisely, if aVij—Aij > we have 6(a) = Qp{a) and else 9(a) = Qm{o) 
with both Qp and (5„j degree 2 polynomial. So by checking the sign of aVij—Aij 
and the roots of Qp and Qm we have sufficient conditions for WS after modifying 
one link. 

Another possible way for both adding or modifying one link is to estimate 
the spectral radius of |i?'(A')| through simple power iterations and concludes 
using (ii). Indeed if a matrix M as a unique eigenvalue of modulus equal to its 
spectral radius then the sequence 

(6fe,M5fc) Mbk 
{bk,bk) \\Mbk\\ 

converges to this eigenvalue. While the model remains sparse, with connectiv- 
ity K, a power iteration of R'{A') requires 0{KN) operations, and it is then 
possible to conclude about the WS of A' in 0{KN). Keeping track oiW(A)-^, 
which requires 0{N'^) operations, at each step is not needed anymore but we 
have to test WS for each possible candidate link. Note that computing the 
spectral radius gives us a more precise information about the WS status of the 
model. 



Removing one link Removing one link of the graph will change the matrix 
A in A' such as |i?'(A')| < |i?'(v4)| where the comparison (<) between two 
matrices must be understood as the element-wise comparison. Then dealing 
with positive matrices elementary results gives us p{\R'{A')\) < p{\R'{A)\) and 
thus removing one link of a WS model provide a new WS model. 



3.4 Greedy Graph construction Algorithms 
Algorithm 1: Incremental graph construction by link addition 

51 INPUT: the MST graph, and corresponding covariance matrix C. 

52 : select the link with highest AL compatible with the WS preserving 



condition of A' in the Gaussian case. Update C according to (3.7 1 for the 
Gaussian model. For the Ising model, C is updated by first running BP to 
generate the set of beliefs and co-beliefs supported by the current factor 



graph, which in turn allows one to use (2.24 1 to get all the missing entries 
in C by inverting x~^- 

S3 : repeat S2 until convergence (i) or until a target connectivity is reached (ii) 
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S4 : if (ii) repeat S2 until convergence by restricting the link selection in the 
set of existing ones. 

The complexity is respectively 0{N^) both for the Gaussian and the Ising model 
in the sparse domain where 0{N) links are added and respectively 0{N'^) and 
0{N^) in the dense domain. Indeed, the inversion of hi the Ising case costs 
0{N^) operations as long as the factor graph remains sparse, but 0{N^) for 
dense matrices. 



Algorithm 2: Graph surgery by Unk addition/deletion 

51 INPUT: the MST graph, and corresponding covariance matrix C, a link 
penalty coefficient v. 

52 : select the modification with highest AL — siy, with s = +1 for an addition 
and s = — 1 for a deletion, compatible with the WS preserving condition 



oi A' in the Gaussian case. Update C according to (3.7) and (3.9) respec- 
tively for an addition or a change and a deletion for the Gaussian model. 
For the Ising model, C is updated by first running BP to generate the set 
of beliefs and co-beliefs supported by the current factor graph, which in 



turn allows one to use (2.24) to get all the missing entries in C by inverting 

S3 : repeat 82 until convergence. 

In absence of penalty {v = 0) the algorithm will simply generate a model for 
all different mean connectivities, hence delivering an almost continuous Pareto 
set of solutions, with all possible trade-off between sparsity and likelihood as 
long as walk summability is satisfied. 

Instead, with a fixed penalty the algorithm is converging toward a solution 
with a connectivity depending implicitly on v; it corresponding roughly to the 
point K* where the slope is 

If we want to use the backtracking mechanism allowed by the penalty term 
without converging to a specific connectivity, we may also let v be adapted 
dynamically. A simple way is to adapt v with the rate of information gain by 
letting 

= rjACadd, with 77 e [0, 1[, 

where ACadd corresponds to the gain of the last link addition. With such a 
setting v is always maintained just below the information gain per link, allow- 
ing thus the algorithm to carry on toward higher connectivity. Of course this 
heuristic assumes a concave Pareto front. 
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4 Perturbation theory near the Bethe point 



4.1 Linear response of the Bethe reference point 

The approximate Boltzmann machines described in the introduction are ob- 
tained either by perturbation around the trivial point corresponding to a model 
of independent variables, the first order yielding the Mean-field solution and 
the second order the TAP one, either by using the linear response delivered in 
the Bethe approximation. We propose to combine in a way the two procedures, 
by computing the perturbation around the Bethe model associated to the MST 
with weights given by mutual information. We denote by T C £, the subset 
of links corresponding to the MST, considered as given as well as the suscep- 



tibility matrix [xBet/ie] given explicitly by its inverse through (2.24), in term 
of the empirically observed ones x- Following the same lines as the one given 
in Section [2] we consider again the Gibbs free energy to impose the individual 
expectations m — {rhi} given for each variable. Let J^^'^^^ = {Kij, G T} 
the set of Bethe-Ising couplings, i.e. the set of couphng attached to the MST 
s.t. corresponding susceptibilities are fulfilled and J = {J^ , £ S} a. set of 
Ising coupling corrections. The Gibbs free energy reads now 

G[m, J] = h'^(m)m-l-F[h(m), J'^'^""' -f J] 



where h(m) depends implicitly on m through the same set of constraints (2.71 
as before. The only difference resides in the choice of the reference point. We 
start from the Bethe solution given by the set of coupling J^^^*'"^ instead of 
starting with an independent model. 

The Plefka expansion is used again to expand the Gibbs free energy in power 
of the coupling Jij assumed to be small. Following the same lines as in Sec- 
tion [2?l] but with Go now replaced by 



GBethe[m] = h^(m)m - log Zsethe [h(m), J^-^*"^] , 

and hi, J^^*'''^ and Z Bethe given respectively by ( 2.2l|2.22|2.23 ) where £ is now 
replaced by T, letting again 



H — JijSiSj, 



i<j 

and following the same steps ( 2.13|2.14|2.15 1 leads to the following modification 
of the local fields 



^Bethe 

3 



Yy^Betheh C0VBetheiH\s^) Vl G V 



to get the following Gibbs free energy at second order in a (after replacing 
by aH^): 

G[m,aJ] = GBet/ie(m) - aEBethe{H^) 
2 

-^(VarBet/ie(-ff^) - ^[XBltheh COY B ethe{H^ , S i) COY B ethe{H^ j S j)^ +o{a^). 
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This is the general expression for the hnear response near the Bethe reference 
point that we now use. 



Gbm[J] - -EBethe(ff') (4.1 

- ^(Va.TBethe{H^) - y^iXslthelij GoV BethejH^ , Sj) Cov BethejH'^ , S j) 
id 

(4.2) 

represents the Gibbs free energy at this order of approximation. It it is given 
exphcitly through 

'^BetheiH^) = ^ JijUlij 

i<j 

^^arBetheiH^) = ^ JijJki{mijki - mijiTiki) 

i<j,k<l 

GovBethe{H^, Sk) = ^ Jij{mijk ~ m.ijmk) 

where 

m,jk = Esethe(siSjSfe), m^jkl = EBethe{SiSjSkSl) 

are the moments dehvered by the Bethe approximation. With the material given 
in Section 2.2 these are given in closed form in terms of the Bethe susceptibility 
coefficients XBethe- Concerning the log-likelihood, it is given now by: 

L[J] = -GBetheim) - Gblr[J] - Y.^JF,"''"' + J^J)mJ + o{J^). (4.3) 

Gblr is at most quadratic in the J's and contains the local projected Hessian 
of the log likelihood onto the magnetization constraints (2.7 1 with respect to 
this set of parameters. This is nothing else than the Fisher information matrix 
associated to these parameter J which is known to be positive-semidefinite, 
which means that the log- likelihood associated to this parameter space is convex. 
Therefore it makes sense to use the quadratic approximation (4.3) to find the 
optimal point. 



4.2 Line search along the natural gradient in a reduced 
space 

Finding the corresponding couplings still amounts to solve a linear problem of 
size N'^ in the number of variables which will hardly scale up for large system 
sizes. We have to resort to some simplifications which amounts to reduce the 
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size of the problem, i.e. the number of independent couphngs. To reduce the 
problem size we can take a reduced number of link into consideration, i.e. the 
one associated with a large mutual information or to partition them in a way 
which remains to decide, into a small number q of group Qj^,v = 1^ . . .q. Then, 
to each group v is associated a parameter with a global perturbation of the 
form 

q 

1^=1 

where each H^, involves the link only present in Q^. 



SiSj 



and the Jij are fixed in some way to be discussed soon. The corresponding con- 
straints, which ultimately insures a max log-likelihood in this reduced parameter 
space are then 



This leads to the solution: 



v=l 

where the Fisher information matrix T has been introduced and which reads in 
the present case 

— [C0VBethe[H^, H^) — ^ [XBlthe\iJ Cov^et/te (^^/i , Si) COYBethe{Hui Sj)] 

(4.4) 

The interpretation of this solution is to look in the direction of the natural 
gradient [2 13] of the log likelihood. The exact computation of the entries of the 
Fisher matrix involves up to 4th order moments and can be computed using 



results of Section 2.2 At this point, the way of choosing the groups of edges 
and the perturbation couplings Jij of the corresponding links, leads to various 
possible algorithms. For example, to connect this approach to the one proposed 
in Section [X2| the first group of links can be given by the MST, with parameter 



ao and their actual couplings Jij — J^^'^*'"^ at the Bethe approximation; making 
a short list of the q—1 best links candidates to be added to the graph, according 
to the information criteria [3^ defines the other groups as singletons. It is then 
reasonable to attach them the value 

1 T^OO 7)01 10 

T — Intr 

4 ^ tP^ ' 

-Pjj -Pij I'lj I'lj 



of the coupling according to (3.2), while the modification of the local fields as a 
consequence of (3.2) can be dropped since the Gibbs free energy take it already 
into account implicitly, in order to maintain single variable magnetization nii = 
rhi correctly imposed. 
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4.3 Reference point at low temperature 



Up to now we have considered the case where the reference model is supposed to 
be a tree and is represented by a single BP fixed point. From the point of view 
of the Ising model this corresponds to perturb a high temperature model in the 
paramagnetic phase. In practice the data encountered in applications are more 
likely to be generated by a multi-modal distribution and a low temperature 
model with many fixed points should be more relevant. In such a case we 
assume that most of the correlations are already captured by the definition of 
single beliefs fixed points and the residual correlations is contained in the co- 
beliefs of each fixed point. For a multi-modal distribution with q modes with 
weight Wk^ k = 1 . . . q and a pair of variables (s;, Sj) we indeed have 



9 9 



X^J = y_^Wk Cov(s„s,|fc) + > ?«fe(E(s,|fc)-E(s,))(E(s,|fc)-E(s,)) 



fc=l k=\ 



— ^ ^mira I inter 



where the first term is the average intra cluster susceptibility while the sec- 
ond is the inter cluster susceptibility. All the preceding approach can then 
be followed by replacing the single Bethe susceptibility and higher order mo- 
ments in equations ( 4.1|4.4 1 in the proper way by their multiple BP fixed point 



counterparts. For the susceptibility coefficients, the inter cluster susceptibility 
coefficients ^"^^^"^ are given directly from the single variable belief fixed points. 
The intra cluster susceptibilities x'^ are treated the same way as the former 
Bethe susceptibility. Th is me ans that the co-beliefs of fixed points fc G {1, . . . (7} 



are entered in formula (2.24) which by inversion yields the x'^'^i these in turn 
leading to t^™*'"'^ by superposition. Higher order moments are obtain by sim- 
ple superposition. Improved models could be then searched along the direction 
indicated by this natural gradient. 



5 Lq norm penalized sparse inverse estimation 
algorithm 

We propose here to use the Doubly Augmented Lagrange (DAL) method [551 
nn [TU] to solve the penalized log-determinant programming in ( |2.28 1. For a 
general problem defined as follows: 

minF(a;) = /(x) +.g(a;) (5.1) 

X 

where ]{x) and g(x) are both convex. DAL splits the combination of j(x) 
and g{x) by introducing a new auxiliary variable y. Thus, the original convex 
programming problem can be formulated as : 

minF(a::) = /(a;) + g{y) 

^'^ (5.2) 
s.t. X — y ^ 
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Then it advocates an augmented Lagrangian method to the extended cost func- 



tion in (5.2). Given penalty parameters ^ and 7, it minimizes the augmented 



Lagrangian function 

L{x,y,i^,x,y) = f{x)+g{y) + {iy,x-y) + '^\\x-y\\l + ^\\x-S:\\l + ^\\y-y\\l (5.3) 

where x and y are the prior guesses of x and y that can obtained either from a 
proper initiahzation or the estimated resuh in the last round of iteration in an 
iterative update procedure. Since optimizing jointly with respect to x and y is 
usually difficult, DAL optimizes x and y alternatively. That gives the following 
iterative alternative update algorithm with some simple manipulations: 

= mm fix) + - + D'^Wl + "A\x- 

y^+' = mmg{y) + ^\\x'^+' y + + - y^Wl (5.4) 
y Z Z 

where v = ^v. As denoted in [TU] and [55], DAL improves basic augmented 
Lagrangian optimization by performing additional smooth regularization on es- 
timations of X and y in successive iteration steps. As a result, it guarantees not 
only the convergence of the scaled dual variable j>, but also that of the proximal 
variables x^ and y^ ^ which could be divergent in basic augmented Lagrangian 
method. 

We return now to the penalized log-determinant programming in sparse 



inverse estimation problem, as seen in (2.281. The challenge of optimizing the 



cost function is twofold. Firstly, the exact Lo-norm penalty is non-differentiable, 
making it difficult to find an analytic form of gradient for optimization. Further- 
more, due to the log-determinant term in the cost function, it implicitly requires 
that any feasible solution to the sparse approximation A of the precision matrix 
should be strictly positive definite. The gradient of the log-determinant term 
is given by (7 — Ar^ ^ which is not continuous in the positive definite domain 
and makes it impossible to obtain any second-order derivative information to 
speed up the gradient descent procedure. We hereafter use S'+_|_ as the symmet- 
ric positive definite symmetric matrices that form the feasible solution set for 



this problem. By applying DAL to the cost function (2.28), we can derive the 
following formulation: 

J{A, Z, A, Z,iy)=- log det(A) + Tt{CA) + XP{Z) + {v,A- Z) 

+ ^\\A-Z\\l + l\\A-A\\l + l\\Z-Z\\l (5.5) 



s.t. A, Z e 5_ 



++ 



where Z is the auxiliary variable that has the same dimension as the sparse 
inverse estimation A. A and Z are the estimated values of A and Z derived in 
the last iteration step. The penalty parameter 7 controls the regularity of A 
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and Z. By optimizing A and Z alternatively, the DAL procedure can be easily 
formulated as an iterative process as follows, for some (5 > 0: 



A^+i = argmin- logdet(yl) + llx{CA) + \P{Z^) + (^/^ A - Z^) 

A 

+ t^\\A~Z^\\l + l\\A^A^\\l 

Z^+i = argminAP(Z) + {v\A^+' - Z) + ^^\\A^+^ - Z\\l 

z 2 (5.6) 



2' 



2 



s.t. A^+\Z^+^ e 5++ 

By introducing the auxiliary variable Z ^ the original penalized maximum 
likelihood problem is decomposed into two parts. The first one is composed 
mainly by the convex log-determinant programming term. Non-convex penalty 
is absorbed into the left part. Separating the likelihood function and the penalty 
leads to the simpler sub-problems of solving log-determinant programming using 
eigenvalue decomposition and Lq norm penalized sparse learning alternatively. 
Each sub-problem contains only one single variable, making it applicable to call 
gradient descent operation to search local optimum. Taking v = -v, we can 
derive the following scaled version of DAL for the penalized log-cfeterminant 
programming: 



A 



2 ' 9 H Il2 



A'^+i = argmin-logdet(A) + Ty{CA) + ^\\A-Z'' + vHI + - A'^"^ 

2 " 2 " 

^\\A^+^-Z + v% + l\ 



Z'^+i = argmin'^M'^+i - Z + P'^t, + -^\\Z - Z^r + XP{Z) 



z 

1-4-1 ~h 

V 

Ak+l yk+l 



~k+l — ^k ^ j^k + l _ ^k+1 



S.t. A^+^.Z^^' e S++ 

(5.7) 

To attack the challenge caused by non-differentiability of the exact Lq norm 
penalty, we make use of a differentiable approximation to Lo-norm penalty in 
the cost function J, named as "seamless Lq penalty" (SELO) in ^25^. The basic 
definition of this penalty term is given as: 

where Zi^j denotes individual entry in the matrix Z and r > is a tuning 
parameter. As seen in Figure 2.2 as t gets smaller, P{Zi j) approximates 



better the Lq norm I{Zij ^ 0). SELO penalty is differentiable, thus we can 
calculate the gradient of P{Z) explicitly with respect to each Zij and make 
use of first-order optimality condition to search local optimum solution. Due 
to its continuous property, it is more stable than the exact Lq norm penalty in 
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optimization. As proved in [23], the SELO penalty has the oracle property with 
proper setting of r. That's to say, the SELO penalty is asymptotically normal 
with the same asymptotic variance as the unbiased OLS estimator in terms of 
Least Square Estimation problem. 



The first two steps in ( 5.7 1 are performed with the positive definite constrains 
imposed on A and Z. The minimizing with respect to A is accomplished easily by 
performing Singular Vecto r De composition (SVD). By calculating the gradient 



of J with respect to A in (5.7), based on the first-order optimality, we derive: 



C^A-^ + ^i{A-Z'' + i)'') + j{A-A'') = (5.9) 

Based on generalized eigenvalue decomposition, it is easy to verify that 
j^k+i _ y Diag(/3)y'^, where V and {di} are the eigenvectors and eigenvalues 
of ij{Z'' — v^) — C -\- J A''. Pi is defined as: 



2(iy + 7) 

Imposing Z e S-^-j^ directly in minimizing the cost function with respect to Y 
make the optimization difficult to solve. Thus, instead, we can derive a feasible 
solution to Z by a continuous search on /i. Based on spectral decomposition, 
it is clear that X'^'^^ is guaranteed to be positive definite, while it is not nec- 
essarily sparse. In contrast, Z is regularized to be sparse while not guaranteed 
to be positive definite, /i is the regularization parameter controlling the margin 
between the estimated X''^-^ and the sparse Z^^^ . Increasingly larger [l dur- 
ing iterations makes the sequences {X^^ and {Z'"'} converge to the same point 
gradually by reducing margin between them. Thus, with enough iteration steps, 
the derived Z^ follows the positive definite constraint and sparsity constraint 
at the same time. We choose here to increase /x geometrically with a positive 
factor 77 > 1 after every A^^ iterations until its value achieves a predefined upper 
bound /^max- With this idea, the iterative DAL solution to the Lq norm penalty 
is given as: 

= argmin- log detM) + TAG A) ^\\A- Z^ ^ v^f.^ + ^\\A - A^f^, 
^ 2" 2" 

Z'^+i ^argmin^M'=+i -Z + i>'^'||^ + ^||Z-Z'=||^ + AP(Z) 
2 " 2 ' 



z 



~k+i ^~k^ ^fe+i _ ^fe+i 
/+i=min(Mr7L™,^_,) 



(5.11) 

In the second step of ( 5.11[ ), we calculate the gradient of the cost function 



with respect to Z and achieve the local minimum by performing the first-order 
optimum condition on it. Therefore, the updated value of each entry of Z is 
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given by a root of a cubic equation, as defined below: 



if Zi^j > 0, Zij is the positive root of 

2Z,/ + (3r - 29,j)Z,/ + (r^ - 3r0,.,)Z,,, - t^9,,j + = 

+ l 

if Zi,j < 0, Zij is the negative root of (5-12) 
2Z,/ ~ (3r + W^^j)Z,/ + (r^ + 3r0,,j)Z,j - r^^?,^ - = 



/^ + 7 



else Zij = 
where Z^j- is one single entry of Z and 



A* + 7 



Solving the cubic equations can be done rapidly using Cardano's formula within 
a time cost O(n^). Besides, the spectral decomposition procedure has the gen- 
eral time cost 0{n^). Given the total number of iterations K, theoretical com- 
putation complexity of DAL is 0{Kn^). For our experiments, we initialize /i to 
0.06, the multiplier factor rj to 1.3 and the regularization penalty parameter 7 
to 10"''. To approximate the Lq norm penalty, r is set to be 5 • 10"''. In our 
experiment, to derive the Pareto curve of the optimization result, we traverse 
different values of A. Most learning procedures converge with no more than 
K = 500 iteration steps. 

To validate performance of sparse inverse estimation based on the Lq norm 
penalty, we involve an alternative sparse inverse matrix learning method using 



Li norm penalization for comparison. Taking P{A) in (2.281 to be the Li ma- 
trix norm of ^4, we strengthen conditional dependence structure between random 
variables by jointly minimizing the negative log likelihood function and the Li 
norm penalty of the inverse matrix. Since Li norm penalty is strictly convex, we 
can use a quadratic approximation to the cost function to search for the global 
optimum, which avoids singular vector decomposition with complexity of 0(j>^) 
and improves the computational efficiency of this solution to 0{p), where p is 
the number of random variables in the GMRF model. This quadratic approxi- 
mation based sparse inverse matrix learning is given in [5], named as QUIC. We 
perform it directly on the empirical covariance matrix with different settings of 
the regularization coefficient A. According to works in compressed sensing, the 
equality between Li norm penalty and Lq norm penalty holds if and only if the 
design matrix satisfies restricted isometry property. However, restricted isome- 
try property is sometimes too strong in practical case. Furthermore, to our best 
knowledge, there is no similar necessary condition guaranteeing equivalence be- 
tween Li and Lq norm penalty in sparse inverse estimation problem. Therefore, 
in our case, Li norm penalized log-determinant programming is highly likely to 
be biased from the underlying sparse correlation structure in the graph, which 
leads to much denser inverse matrices. 
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6 Experiments 



In this section, various solutions based on the different methods exposed before 
are compared. We look first at the intrinsic quality, given either by the exact 
log likelihood for the Gaussian case, or by the empirical one for the Ising model, 
and then at its compatibility with belief propagation for inference tasks. 



Inverse Ising problem Let us start with the inverse Ising problem. The 
first set of experiments illustrates how the linear-response approach exposed in 
Section [2] works when the underlying model to be found is itself an Ising model. 
The quality of the solution can then be assessed directly by comparing the 



<Ji|>=0 hi=l) <Jij>={) <hi>=0 




(c) (d) 

Figure 6.1; Comparison between various approximate solutions to the inverse 
Ising problem. RMSE errors as a function of the temperature are plotted in (a) 
and (b) for the couplings J^, in (c) and (d) for the susceptibility matrix Xij 
obtained from the corresponding BP fixed point. Local fields hi are zero in (a) 
and (c) and finite but zero in average in (b) and (d). 



couplings Jij found with the actual ones. Figure 6.1 are obtained by generating 
at random 10^ Ising models of small size TV = 10 either with no local fields 
{hi = 0,Vi = l...A^) or with random centered ones hi = U[0,1] — 1/2 and 
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(a) 



(b) 



Figure 6.2: Comparison of the greedy information based MRF inference method 
with Lq and Li norm penaHzed optimizations, (a) corresponds to the Sioux-Fall 
data of size N — 60. (b) corresponds to the lAU data of size N = 1000. < K > 
is the ratio of links to nodes iV. The log likelihood on the y axes is unormalized 
and corresponds to (2.29). The reference is the log likelihood given by the full 



inverse covariance matrix. 



= (2 * U[0, 1] — 1), centered with variance J'^/N, J 



with couplings Ji. 



being the common rescaling factor corresponding to the inverse temperature. A 
glassy transition is expected at J = 1. The couplings are then determined using 
( |2.16| ), ( |2.17[ ), ( |2.22| ) and ( |2.25[ ) respectively for the mean-field, TAP, BP and 
Bethe (equivalent to susceptibility propagation) solutions. Figure 6.1 a shows 
that the Bethe approximation yields the most precise results in absence of local 
fields while it is equivalent to TAP when a local field is present as shown on 
Figure |6.1| b. Since we want to use these methods in conjunction with BP we 
have also compared the BP-susceptibilities they deliver. To do that, we simply 
run BP to get a set of belief and co-beliefs in conjunction with equation (2.241 
which after inversion yields a susceptibility matrix to be compared with the 
exact ones. The comparison shown on Figure |6.1| c indicates that Bethe and 
TAP yield the best results in absence of local field which are less robust when 
compared to the more naive BP method when local fields are present as seen on 
Figure 6.1 d. This is due to the fact that BP delivers exact beliefs when model 
( 2.22|2.21 ) is used, which is not necessarily the case for other methods when the 
local fields are non-vanishing. It is actually not a problem of accuracy but of 
BP compatibility which is raised by this plot. 



Sparse inverse models Let us now test the approach proposed in Section [372] 
to build a model link by link for comparison with more conventional optimization 
schema based on Lo and Li penalizations. We show the results of tests only for 
the Gaussian case where the link surgery can be treated exactly. In the Ising 
case the method can be used only marginally to propose little correction on the 
maximum spanning tree or any other sparse model. In general we cannot expect 
the method to be able to compete with the inverse susceptibility propagation 
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schema i.e. what we call here the Bethe inverse model (2.25). The reason is that 
the LL gain given by one link is more costly to assess than in the Gaussian case 
and it is also only approximate. So the stability of the schema is more difficult to 
control when many links have to be added because the condition of the validity 
of the Bethe approximation are not controlled without paying an additional 
computational price. For the Gaussian case instead the situation is much more 
favorable because the gain can be computed exactly with low computational 
cost even when the graph is dense. The test we show on Figure |6.2| are done 
on simulated data produced by the traffic simulator METROPOLIS [9], our 
original motivation for this work being related to traffic inference [121 H] • The 
first set corresponds to a small traffic network called Sioux-Fall consisting of 
72 links, from which we extract the N = 60 most varying ones (the other one 
being mostly idle). The second set (lAU) is obtained for a large scale albeit 
simplified network of the Paris agglomeration of size 13626 links, out of which 
we extracted a selection of the N = 1000 most varying ones. Each sample data 
is a A^-dimensional vector of observed travel times {ti,i — 1 . . . N}, giving a 
snapshot of the network at a given time in the day. The total number of samples 
is S' = 3600 for Sioux-Falls and S = 7152 for lAU, obtained by generating many 
days of various traffic scenarios. Then for each link the travel time distribution is 
far from being Gaussian, having heavy tails in particular. So to deal with normal 
variables (if taken individually) we make the following standard transformation: 

y^^FGaussMU), V* - 1 . . . TV (6.1) 

which map the travel time to a genuine Gaussian variable yi, where Fi and 
Fcauss are respectively the empirical cdf of U and of a centered normal variable. 
The input of the different algorithms under study is then the covariance matrix 
Cov(yi,yj). This mapping will actually be important in the next section when 
using the devised MRF for inference tasks. Figure [6^ displays the comparison 
between various methods. Performances of the greedy method are comparable 
to the Lq penalized optimization. To generate one solution both methods are 
comparable also in term of computational cost, but the greedy is faster in very 
sparse regime, and since it is incremental, it generate a full Pareto subset for 
the cost of one solution. On this figure we see also that the Li method is simply 
not adapted to this problem. From the figures, we can see that the estimated 
inverse matrix derived based on Li norm penalty needs distinctively more non- 
zero entries to achieve similar log-likelihood level as the Lq penalty, indicating 
its failure of discovering the underlying sparse structure, the thresholding of 
small non-zero entries being harmful w.r.t. positive definitness. The reason 
might be that is adapted to situations where a genuine sparse structure exists, 
which is not the case in the present data. 



Inverse models for inference We turn now to experiments related to the 
original motivation of this work, which is to use calibrated model for some 
inference tasks. The experiments goes as follows: we have an historical data 
set consisting of a certain number of samples, each one being a A^-dimensional 
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(a) (b) 

Figure 6.3: Comparison of decimation curves in the case of a multi-modal dis- 
tribution with five cluster for N = 500 variables (a). Projection of the dataset 
in the 3-d dominant PCA space along with the corresponding BP fixed points 
projections obtained with the Ising model. 



variable vector, say travel times, which serves to build the models Given a 
sample test data we want to infer the (1 — p)N hidden variables when a certain 
fraction p of the variables are revealed. In practice we proceed gradually on 
each test sample by revealing one by one the variables in a random order and 
plot as a function of p the Li error made by the inference model on the hidden 
variables. Both for Ising and Gaussian MRP, the inference is not performed 
in the original variable space, but in an associated one obtained through a 
mapping (a traffic index) using the empirical cumulative distribution of each 
variable. For the Gaussian model the inference is performed in the index space 



defined previously by (6.1). For the Ising models we have studied a variety 
of possible mapping j27l in order to associate a binary variable to a real one 
such that a belief associated to a binary state can be converted back into a 
travel time prediction. Without entering into the details (see [23 for details), 
to define in practice this binary state Ui, either we make use of the median value 
^1/2 _ ^-1(1/2) in the distribution of Xi for alH = 1 . . . N: 

Either we perform a soft mapping using the cdf: 

P{a, = 1) = F,{xi) (li), 

the last one having the advantage of being functionally invertible if is 
defined, while the former one being inverted using Bayes rule. The data we are 
considering are "low temperature" data in the sense that correlations are too 
strong for an Ising model with one single fixed point. This is rcfiected in the fact 



^In fact the pairwise MRF models exploit only pairwise observations but for sake of com- 
parison with a KNN predictor we generate complete historical samples data. 
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that none of the basic methods given in the Section [2] is working. To overcome 
this we use a simple heuristic which consists in to add a parameter a 6 [0, 1] 
in the BP model like e.g. in (2.26) or to multiply the Jij by a for the MF, 
TAP and Bethe models, the local field being consequently modified owing to 
their dependency on the Jij. Concerning the factor-graph we have considered 
various graph selection procedures. All are based on the mutual information 
given empirically between variables. A global/local threshold can be used to 
construct the graph, the parameter being the mean/local connectivity K; the 
MST can be used conveniently as a backbone and additional links are obtained 
through the thresholding selection procedures. These two parameter a and K 
are calibrated such as to optimize the performance for each type of model so 
that fair comparisons can be made afterward. One important difference between 
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Figure 6.4: Comparison of decimation curves between various MRF for Sioux- 
Falls data (a) and lAU data (b) 



the Ising model and the Gaussian one is that multiple fixed points may show 
up in the Ising case while only a single one, stable or not stable, is present 
in the Gaussian case. This can be an advantage in favor of the Ising model 
when the data have well separated clusters. Figure |6.3| illustrates this point. 
The data are sampled from a distribution containing 5 modes, each one being a 
product form over TV random bimodal distributions attached to each link. On 
Figure |6.3|a which displays the error as a function of the fraction of revealed 



variables we see that the Ising model obtained with (2.261, encoded with the 



median value (i), gives better prediction than the exact Gaussian model or the 



approximated GaBP compatible one. Indeed Figure 6.3 b shows a projection of 
the data in the most relevant 3-d PCA space along with the projected position 
of BP fixed points (given by their sets of beliefs) delivered by the Ising model. 
As we see, the model is able to attach one BP fixed point to each component 
of the distribution. Ideally we would like a perfect calibration of the Ising 
model in order that these fixed points be located at the center of each cluster. 
The method proposed in Section |4] could help to do this, but has not been 
implemented yet. On Figure [6?3l a we see also that the knn predictor performs 
optimally in this case, since the error curve coincides exactly with the one given 
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by the hidden generative model of the data (ground truth). Figure 6.4 shows 
how the different models compare on the data generated by the traffic simulator. 
On the Sioux-Falls network, the Gaussian model gives the best results, and a 
sparse version obtained with the greedy algorithm of section [3^ reach the same 
level of performance and outperforms knn. The best Ising model is obtained 
with the (2.26) with type (ii) encoding. For lAU the full Gaussian model is also 
competitive w.r.t KNN, but the best sparse GaBP model is not quite able to 
follow. In fact the correlations are quite high in this data, which explain why 
the best Ising model shows very poor performance. The best Ising model in 
that case corresponds to the plain BP model with type (ii) encoding and MST 
graph. 



7 Conclusion 

This paper is based on the observation that the Bethe approximation can be 
in many case a good starting point for building inverse models from data ob- 
servations. We have developed different ways of perturbing such a mean-field 
solution valid both for binary and Gaussian variables, and leading to an effi- 
cient algorithm in the Gaussian case to generated sparse approximation models 
compatible with BP. The additional requirement that the model be compatible 
with BP for large scale application discards dense models and simplifies in a 
way the search space on model selection. More experimental tests on various 
data should help to refine and settle the general methods proposed here. 
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